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Abstract: We derive a kinematic variable that is sensitive to the mass of the Standard 
Model Higgs boson [Mh) in the H — )• WW* — )■ channel using symbolic regression 

method. Explicit mass reconstruction is not possible in this channel due to the presence 
of two neutrinos which escape detection. Mass determination problem is that of finding 
a mass-sensitive function that depends on the measured observables. We use symbolic 
regression, which is an analytical approach to the problem of non-linear regression, to 
derive an analytic formula sensitive to Mh from the two lepton momenta and the missing 
transverse momentum. Using the newly-derived mass-sensitive variable, we expect Higgs 
mass resolutions between 1 to 4 GeV for Mh between 130 and 190 GeV at the LHC with 
10 fb^^ of data. This is the first time symbolic regression method has been applied to a 
particle physics problem. 
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1. Introduction 

In light of the current limits on the Standard Model (SM) Higgs boson mass (Mh), H — )• 
WW* — )• channel is expected to be one of the most important channels in the 

search for the Higgs boson fl], |, |, g. Direct search results at the CERN LEP e+e" cohider 
places a lower limit of 114.4 GeV on Mh at 95% confidence level (C.L.) |^. And indirect 
constraints obtained from fits to precision electroweak data, when combined with direct 
searches at LEP, place an upper bound of 157 GeV at 95% C.L. Q. In this mass range, the 
branching fraction of -fT — )• WW is sizable and the production oi gg ^ H through top quark 
loop has the largest cross section for both the Tevatron and the LHC energies . Discovery 
of the Higgs boson and measurement of its properties are important for completing the 
picture of electroweak symmetry breaking mechanism. However, measurement of mass in 
H — > WW* is not trivial. 

2. Mass Reconsruction in — )■ WW* — )■ t^vi'v 

In the H — )• WW* — )• t^vt~v channel, there are two neutrinos which escape detection. 
The system is underconstrained and it is not possible to determine the momenta of the two 
neutrinos. Typical analyses in these channels involve selection criteria on simple kinematic 
variables, and cross section upper limits are derived using the distributions of dilepton 
azimuthal opening angle which reflects the spin nature of the Higgs boson ^. To 
increase the sensitivity of the searches and to measure the mass, it is desirable to have a 
variable that has direct information on the mass of the Higgs boson (Mh)- 

There are a couple of kinematic variables that can be used for mass reconstruction in 
the H WW* i+iy£-i? channel §, 0]. They are either generalizations of transverse 
mass {Mt) or modifications of solutions to kinematic problens in supersymmetric (SUSY) 
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models. These variables show linear behavior to Mh and are sensitive to it. While these 
variables are motivated by kinematics of the event, it is not clear in what sense these 
variables are optimal. 

We approach the problem of Mh measurement from a different perspective. Since the 
kinematics of an event reflects Mh, we should be able to extract Mh from the leptons and 
missing transverse momenta which contain the maximal information. We would like to 
construct a function such that, on average, < F{x{m)) >= m, where x are the measured 
quantities from a detector. Since there are infinitely many such functions, some optimiza- 
tion condition is necessary to arrive at an optimal function. This is a problem of non-linear 
multivariate regression. 

There are advanced analyses techniques which allow us to construct a multivariate 
function F{x) that can be regarded as an approximation to the ideal function F{x) 
A function is built from a training data set such that F{x) ~ F{x). Its performance can 
be evaluated on a test sample, from variance or some other measure of error. However, 
most of these methods are black boxes, such that if we write down the solution, we would 
not be able to make much sense of it. Also, these methods are not able to generalize 
sufficiently and undesired biases show up for data sample close to the boundaries of input 
variables. Instead, we take an analytic approach to function approximation called symbolic 
regression. 

3. Symbolic Regression and Application to Mh Reconsruction 
3.1 Symbolic Regression 

In a symbolic regression, a function which minimizes certain criteria (or maximizes fit- 



ness) is constructed from the input variables analytically |^, 10]. Symbolic regression is 
an application of genetic programming methodology and genetic algorithms. Symbolic re- 
gression methods are powerful enough to derive invariants, such as Hamiltonian, from a 
set of experimental data with non-linearities ||ll|. An advantage of symbolic regression is 
its interpretability, in contrast to purely numerical methods. 

Genetic algorithms are often employed in problems of optimization with many param- 
eters. It is an application of natural evolution to computational domain. In a genetic 
algorithm, a set of individuals form a population, where an individual is represented by a 
gene. Each position in a gene can be used to encode some strategy or functionality. Be- 
havior of an individual, also known as phenotype, is determined by its genetic constitution 
or genotype. The fitness of an individual is evaluated at the level of phenotype. 

If two individuals with different genotypes show the same phenotype, then they have 
identical fitness. Fitness is a measure of how well an individual achieves the desired goal. 
In genetic algorithms of computational domains, fitness is explicitly defiend, unlike the 
biological world, where fitness is implicit. 

Evolution of population is achieved through genetic operations which create new geno- 
types or modify existing ones. Cross-over and mutation operations are genetic operators 
that can be used to create individuals for successive generations. Cross-over operation (sex- 
ual reproduction) is applied to a pair of parent genes to create a child gene. An individual 
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can also undergo a point mutation (asexual reproduction) in a gene. Local minima in the 
fitness landscape can be avoided because of this randomness. Strength of genetic algorithm 
approach comes from the randomness of the genetic operations and the variety of genes 
present in a population. Genetic algorithms are finding their way into high energy physics 
in optimization problems 13|. For this study, we created our own genetic algorithm 
package to overcome the limitations of existing tools. 

Symbolic regression is a method that can be applied 
to problems where we want to map input variables to the 
output. It arrives at the answer using genetic algorithms. 
In a symbolic regression, analytic equations form the pop- 
ulation. An individual is usually encoded as an expression 
binary tree (Fig. |l|), but other representations are possible. 
A linear encoding would make the genes look closer to their 
biological counterpart, but this is not necessary. Evaluation 

of fitness and manipulation of the genes are much more ef- Figure 1: A binary tree of ex- 
ficient in a binary tree representation. Internal nodes of a pression (wl + v2) * v3. 
binary tree are operators or functions and terminal nodes 

are either numerical constants or variables. Set of operators, variables, constants, and 
fitness function or minimization criteria must be defined for each problem. 

An initial population is built randomly from a given set of operators, variables and 
constants. Individuals of subsequent generations are created by applying either gene cross- 
over operations to a pair of "father" and "mother" equations to yield a "child" equation 
(Fig. ^) or through mutation on existing expressions. Selection of parents can vary among 
implementations. In this study, each parent is selected through tournaments. A tourna- 
ment is held among a small randomly selected pool from the population and the best-fit 
individual is chosen. Through tournaments, fitter individuals have a greater chance to pass 
on parts of the genes. 

Point mutation operation is applied to each individual nodes randomly with small 
probability. This is independent of the sexual reproduction. Point mutation mimics random 
mutations that occur in biological processes. The effect of mutation is diversification of 
the gene pool. Although random mutations may make the individual less fit, it may still 
be beneficial when an offspring inherits some of the mutations. 

At each generation, individuals are sorted according to their fitness and those with 
poor fitness are discarded by keeping the population size constant. The best fit individuals 
("the elite") are passed along to the next generation without modification, but they can 
participate in sexual reproduction. The number of generations or termination criteria has 
to be decided upon as a parameter of the algorithm. The details of implementation are 
discussed in detail elsewhere |14]. 

In a more traditional method of optimization, a minimum is reached by descending the 
fitness landscape in a smooth manner through incremetal changes. In a genetic algorithm, 
genetic operations introduce local changes in the genes, but the behavior of the child can 
be quite different (Fig |2|). It is understood that fitness landscape can be probed more 
globally with genetic algorithms. Maintaining genetic diversity is crucial to the success 
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since genetic algorithms can still be trapped in local minima if there is not much genetic 
diversity. Applying a strong selection pressure on the population, such as having a large 
fraction of the population participating in a tournament, is not necessarily beneficial since 
it can effectively reduce the genetic pool to that of a few fit individuals. 

Physical dimensions of resulting for- 
mulae of symbolic regression may not be 
correct. This is also true of traditional 
multivariate regression algorithms. How- 
ever, in a symbolic regression, we can con- 
trol the terms that can appear in an ex- 
pression. In this study, we created di- 
mensionally constrained symbolic regres- 
sion (DCSR) where only terms that are 
dimensionally correct can appear. For ex- 
ample, in a DCSR, terms like px +Py/{Px • 
Pz) can present, but not px + P^- In a 
DCSR, cross over operations can only oc- 
cur among branches with the same physi- 
cal dimension. 

In tests of simple problems where we 
know the optimal answer, such as mass 

determination in — )• ii', DCSR, as well as the normal symbolic regression, is able to arrive 
at an equation that differs from the well-known transverse mass (Mt) by a multiplicative 
factor. However, for more complicated problems, solutions of DCSR converge much more 
rapidly. And in some cases, only DCSR produces satisfactory solutions. 




Figure 2: Creation of a child through cross-over 
operation. The cross over of genes occuring be- 
tween v3 and vA — vl from the two parents yields 
a new child {vl + v2) * (w4 — vl). 



WW* 



3.2 Symbolic Regression Applied to H 

Higgs mass determination in — )• WW* — )• l^i'l^i) in hadron colliders is an inportant 
problem. In this channel, two lepton momenta p^nPn and the vector sum of the two 
neutrino transverse momenta =^Tu1+ ^Tu2 are measured in experiments. Since there 
are only two equations related to neutrino momenta, the system is under-constrained. If 
we knew both W bosons were real, we would still need two extra equations to constrain the 
system. Therefore one cannot solve for the neutrino momenta exactly even in principle. 

Existing studies relied on analysis of kinematics to find expressions that behave linearly 
to the Higgs boson mass ^. In this study, we approach the problem as that of finding 
an expression that not only shows linear behavior, but also whose widths of the mass 
distribution are narrow. 

Symbolic regression is applied to a data generated with PYTHIA pp ^ H ^ WW* — >• 
l+vl-u at = 14 TeV with Mh varying from 120 GeV to 200 GeV Detec- 
tor simulation is not applied to the data. Momentum components and energy of the 
two charged leptons ipiT,Pix,Piy,Piz, Ei,p2T,P2x,P2y,P2z, E2) and missing Et information 
{^Jt, fix, f^y) are used as input variables for the symbolic regression. The fitness function 
used is the average of fractional absolute difference: Yli \ ^rec,i — MH,i\/MH,i- 
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Without DCSR, the symbohc regression is not able to yield meaningful results. This 
seems to be due to the larger number of variables used. The number of terms of dimension 
2 with only multiplication allowed is 78. In our implementation, four basic arithmetic 
operators (+,— ,x,-^) and transcendental functions (sin, cos, log, exp) are allowed, which 
makes the number of possible terms infinite. 

If fractional root mean-squared (RMS) {j^[^iiMrec,i — MH,i)'^ /M]^^-]^/'^) were used as 
the fitness function, the symbolic regression would get trapped into local minima even with 
DCSR. This is consistent with what is known about genetic algorithms since outliers pay 
a heavy penalty with such a fitness function. Overall, it has the effect of reducing the 
diversity. 

Figure |3| shows evolution of fitness of best-fit individuals in 100 runs as a function of 
the number of generations. DCSR is able to converge on meaningful results and yields the 
best estimate for the as 

Sl^^gg = 2p{j~ + 2p2T + 3 {piTP2T+ ^t{PIT + P2t)- $T ' (PlT + P2t) - 2piT • P2T^ ■ 

Symmetry of the two leptons in the system is recognized by the symbolic regression auto- 
matically, even though symmetry condition was not imposed. 




Figure 3: Left: Evolution of best fit individuals function in 100 runs. Right: Distribution of 
ratio of predicted mass to true mass, mpred/mH, versus the true Higgs mass mn- The sample was 
generated using PYTHIA. 

For a Higgs of Mh = 160 GeV which decays to two real W bosons, if the charged 
leptons both travel in the same direction, transverse to the beam with longitudinal 
momentum components, Smass = jMh- The other extreme case, where = and the 
two lepton momenta are opposite to each other in the transverse plane, Smass = \Mh. 
Other configurations of lepton momenta and yields different values of Smass ■ Since we 
are assuming perfect knowledge on lepton momenta and ^t, the width of the distribution 
reflects the fact that some of the information on two neutrinos is irretrevably lost. The 
distribution of Smass shows good fractional RMS (Fig. |^. Mass resolution depends not 
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Figure 4: Distribution of reconstructed mass for Smassi^op) for AIh — 140 GeV for pp collisions 
at ^/i = 14 TeV. 



only on the RMS but also on the shape of the distribution, and this is described in a latter 
section under more realistic conditions. By replacing pitP2T with 2piTP2T in Smass, one 
can get a variable with the mean closer to Mh, but its distribution is wider. 

Since the simulated data used to derive the equation was from pp collision at ^/s = 14 
TeV, it is worth to look at how Smass would perform for a different scenario, such as 
Tevatron where pp collides at ^/s = 1.96 TeV. The Higgs bosons at the Tevatron are 
expected to be produced with a smaller boost and the kinematics of the final state particles 
are different. Fortunately, even in this case, the Smass variables shows linearity and similar 
fractional RMS (Fig. |5|). Therefore, we conclude that Smass captures genuine features of 
H WW* l+vi~v system. 

3.3 Mass Sensitivity of Smass variable at LHC 

Sensitivity of a variable to mass depends on the shape of the mass distributions for the 
signals. In order to study the sensitivity of the mass variables under a more realistic 
conditions, we generated pp ^ H ^ WW* — )• i'^i'i~i> events at = 14 TeV using 



MadEvent generator with PGS v4 detector simulation and reconstruction [16j. We assume 
pp — )• WW* and pp — )• ti as backgrounds. We generated the simulated signal samples in the 
mass range between 120 GeV/c^ and 200 GeV/c^ at 2.5 GeV/c^ intervals. Both the signal 
and background samples are scaled to the NLO cross sections by applying appropriate 
K-factors P, ||, |l|]. 



The selection critera are as follows: 

• Two leptons of > 15 GeV and \rj\ < 2.5 

• 12 GeV < Mee < 300 GeV 

• -^r > 30 GeV 
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Figure 5: Left: the linearity of the Smass in pp H ^ WW* l^vl v at Tevatron energies 
= 1.96 TeV. Right: S^ass distribution for Mh = 140. 



• Mt2 > 50 GeV 

• \^(\>u\ < 1.8 

• No hadronic jets with px > 20 GeV 

The last 3 critera reduce the tt backgrounds significantly and WW* backgrounds moder- 
ately. The Mt2 variable is a good variable to use, since signal-to-background increases Q. 
The Smass variable is weakly correlated with the Mt2 variable, and for larger values of Mt2, 
the Smass distribution becomes sharper. The selection has the effect of removing events 
with smaller values of Smass where backgrounds are copious. The Smass distributions for 
various values of Mh are shown in Fig. |6|. 

To take into account theoretical and experimental uncertainties, 10% uncertainty in 
the overall normalization is assumed. To evaluate the uncertainties in mass determina- 
tion, templates of signal and backgrounds Smass distributions are used to conduct pseudo- 
experiments. Log-likelihood is calculated for each mass hypothesis and then fitted with a 
parabola to extract the mass resolutions (Fig. |^. The mass resolution is obtained from 
the parabola when — InC/Cmax = | (Fig- §)• Mass resolution improves from 3.7 GeV to 
1.3 GeV as on-shell decay of Higgs becomes possible. With Mh dependent cuts on 
and Mt2, the mass resoution improves sfightly §. The 

Smass variable is correlated with 
other mass variables for H — )■ WW, but the correlation is not 100%. Therefore, further 
improvement may be possible by forming suitable combinations of the variables. 

4. Conclusions 

Symbolic regression is used to derive a kinematic variable which is sensitive to the mass 
of the Higgs boson in the H — t- WW* — t- channel at hadron colliders. With this 
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Figure 6: Smass distributions of tlie Higgs signal for various Mh expected in 10 fb ^at LHC after 
event selection. 
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Figure 7: Left: Template histograms of tt (dark), WW* (light), and H WW* with Mh = 140 
GeV (white) for 10/5^^. The error bars show what a typical data would look like. Right: Log- 
likelihood as a function of mass for a 160 GeV Higgs. 



variable, the mass of the Higgs boson can be measured with an accuracy of 1 to 4 GeV in 
the Higgs mass range between 130 GeV and 190 GeV at the LHC with 10 fb^^ of data. 
This is the first time symbohc regression method has been applied to high-energy physics 
problem. 
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Figure 8: Linearity (top) and mass resolution (bottom) expected as obtained from pseudo- 
experiments in 10 fb~^ for pp collisions at y/s = 14 TeV, 
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130 


140 


150 


160 


170 


180 


190 


aM„ (GeV) 
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3.7 


2.4 


1.7 


0.8 


1.2 


1.5 


1.8 



Table 1: Expected Higgs mass resolutions using Smass variable with 10 fb^^ data in pp collisions 
at -y/s = 14 TeV. The last row shows mass resolutions expected for mass-dependent optimized 
analyses. 
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